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Abstract 



In this paper, we present a novel computational framework for nonlinear dimensionality reduc- 
tion which is specifically suited to process large data sets: the Exploratory Inspection Machine 
(XIM). XIM introduces a conceptual crosslink between hitherto separate domains of machine 
learning, namely topographic vector quantization and divergence-based neigbor embedding ap- 
proaches. There are three ways to conceptualize XIM, namely (i) as the inversion of the Ex- 
ploratory Observation Machine (XOM) and its variants, such as Neighbor Embedding XOM 
(NE-XOM), (ii) as a powerful optimization scheme for divergence-based neighbor embedding 
cost functions inspired by Stochastic Neighbor Embedding (SNE) and its variants, such as t- 
distributed SNE (t-SNE), and (iii) as an extension of topographic vector quantization methods, 
such as the Self-Organizing Map (SOM). By preserving both global and local data structure, 
XIM combines the virtues of classical and advanced recent embedding methods. It permits 
direct visualization of large data collections without the need for prior data reduction. Finally, 
XIM can contribute to many application domains of data analysis and visualization important 
throughout the sciences and engineering, such as pattern matching, constrained incremental 
learning, data clustering, and the analysis of non-metric dissimilarity data. 



1 Motivation 



In this paper, we present a novel computational framework for nonlinear dimensionality reduc- 
tion which is specifically suited to process large data sets: the Exploratory Inspection Machine 
(XIM). The central idea of XIM is to invert our recently proposed variant of the Exploratory 
Observation Machine (XOM) [[D, [H, [|3l|, namely the Neighbor Embedding XOM (NE-XOM) 
algorithm as proposed in [4] by systematically exchanging the roles of ordering and exploration 
spaces in topographic vector quantization mappings. In contrast to NE-XOM, this is accom- 
plished by calculating the derivatives of divergence-based topographic vector quantization cost 
functions with respect to distance measures in the high-dimensional data space rather than 
derivatives with respect to distance measures in the the low-dimensional embedding space. 

As an important consequence, we can thus introduce successful approaches to alleviate the 
so-called 'crowding phenomenon' as described by ^ into classical topographic vector quanti- 
zation schemes, in which, in contrast to XOM and its variants, data items are associated with 
the exploration space and not with the ordering space of topographic mappings. The result- 
ing learning rules for heavy-tailed neighborhood functions in the low-dimensional embedding 
space are significantly different from corresponding update rules for NE-XOM and its variants 
for other divergences. 

XIM extends classical topographic vector quantization methods in that its cost function does 
not only optimize 'trustworthiness' in the sense of [61, i.e. small distances in the embedding 
space correspond to small distances in the data space: By introducing repulsive forces into 
the learning rules, they optimize 'continuity' as well, such that small distances in the data 
space are represented by small distances in the embedding space. This is exactly opposite 
to NE-XOM, in which divergence-based cost functions introduce repulsive forces that help to 
optimize 'trustworthiness', whereas the original XOM cost function predominantly optimizes 
'continuity'. 

In the light of these considerations, there are three ways to conceptualize XIM as a gen- 
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eral computational framework for nonlinear dimensionality reduction, namely (i) as the sys- 
tematic inversion of the Exploratory Observation Machine (XOM) [[T]|, ^ and its variants, 
such as Neighbor Embedding XOM (NE-XOM) [4J, (ii) as a powerful optimization scheme for 
divergence-based dimensionality reduction cost functions inspired by Stochastic Neighbor Em- 
bedding (SNE) [7J and its variants, such as t-distributed SNE (t-SNE) |I51, and (iii) as an exten- 
sion of topographic vector quantization methods, such as the Self-Organizing Map (SOM) |[8l, 
where XIM introduces the option to explicitly optimize 'continuity'. Thus, XIM introduces 
novel conceptual crosslinks between hitherto separate domains of machine learning. 

In the remainder of this paper, we first motivate XIM in the context of topographic vector 
quantization. We then derive its learning rule from a cost function which is based on the gener- 
alized KuUback-Leibler divergence between neighborhood functions in the data and embedding 
spaces. We further introduce the t-distributed and Cauchy-Lorentz distributed XIM (t-XIM and 
c-XIM) variants which help to address the crowding phenomenon in nonlinear dimensionality 
reduction. After specifying technical details of how to calculate the XIM learning rule by com- 
puting the derivative of its cost function, we discuss algorithmic properties and variants of XIM, 
including batch computation and XIM for non-metric data. After presenting experimental re- 
sults, we finally extend XIM beyond the KuUback-Leibler divergence by deriving XIM learning 
rules for various other divergence measures known from the mathematical literature. 

2 Topographic Mapping 

We first motivate the Exploratory Inspection Machine (XIM) in the context of topographic vec- 
tor quantization techniques. In their simplest form, these algorithms may be seen as meth- 
ods that map a finite number of data points Xj G R^, i E {1, . . . , A^} in an exploration 
space E to low-dimensional image points G IR'^ in an ordering space O. The assignment is 
Xj I— 7- Yi and typically d <^ D, e.g. d = 2,3 for visualization purposes. The mapping Xj i-t- 
is not calculated directly, rather the data points Xj are represented by so-called 'prototypes' 
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Wj e E G JR^,j e {1, . . . , M} that are mapped to target points e O C R'^, so-called 
'nodes'. These nodes are priorly chosen as a structure hypothesis in the observation space O. 
Reasonable choices are the location of nodes on a regular lattice structure, e.g. a rectangular 
or hexagonal grid, but no specific limitations apply, i.e. arbitrary geometrical and even random 
arrangements may be considered. 

Like in topographic vector quantizers, the goal of XIM is to find prototype positions Wj in 
such a way that the tuples (wj, r^) e E x O represent pairs of reference points for defining a 
mapping ^ which allows the user to find a low-dimensional target point for each data point 
Xj. Such an explicit mapping based on computed pairs of reference points can be accomplished 
by various additional interpolation or approximation procedures, as shown later. 

Each prototype defines a 'receptive field' by a decomposition of E according to some 
optimality rule for mapping points to prototypes, e.g. by a criterion of minimal distance to a 
specific prototype. Let the spaces E and O be equipped with distance measures d£;(x, Wj) and 
do{rk, r/). A simple way to define a best-match node r* for a given data vector x is to compute 

r*(x) = r<^(x) = ^'(w<^(x)), (1) 
where 0(x) is determined according to the minimal distance criterion 

dsix, w<^(x)) = min (^^(x, w^)) . (2) 

In a simple iterative approach, learning of prototypes is accomplished by the online 
adaptation rule 

w,- ^ w,- - e K{do{r\ r,-)) " (3) 

where e > denotes a learning rate, do and dE distances in the ordering and exploration spaces, 
respectively, e.g. squared Euclidean, and h„{t) a neighborhood cooperativity, which is typically 
chosen as a monotonically decreasing function, e.g. a Gaussian 

/i,(t) = exp(-^), (j>0. (4) 
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However, it should be noted that the distance measures dE{x,Wj) and do{rk,ri) are not 
confined to squared Euclidean: arbitrary, even non-metric distance measures may be applied. 
Likewise, the neighborhood cooperativity h is not restricted to Gaussians: other choices, such 
as heavy-tailed distributions may provide advantages, as discussed below. 

3 The Exploratory Inspection Machine (XIM) 

It can be shown that for a continuous distribution of inputs, the learning rule ([3]) cannot be 
expressed as the derivative of a cost function BU. According to [fTOll one can circumvent this 
problem by replacing the minimal distance criterion Q by adopting a modified best-match node 
definition 

r*(x) = r0(x) = *(vi^0(x)), (5) 
where 0(x) is determined according to 

/i<^(do(r0(x), r^)) rf£;(x, w^) = mm h^{do{rk, Tj)) dE{yi, w^) j (6) 

This leads to the cost function 

„ M 

E' (X '^r*(x),r, Y ha{do{T*{x.),Tj)) dsi^, Wj)p(x)(ix, (7) 

i i=i 

where 6 denotes the Kronecker delta. The derivative of (|7]) with respect to using the best- 
match definition ^ yields the learning rule dH). 

Given these settings, we propose to augment the cost function (|7]) by an additional term 
in order to combine fast sequential online learning known from topographic mapping, such 
as in learing rule Q, and principled direct divergence optimization approaches, such as in 
SNE llTl. To this end, by means of cost function (|7]) we can define new learning rules based on 
the generalized KuUback-Leibler divergence for not normalized positive measures p and q with 
0<p,q<l 

EoKL{p\\q) = J [p(x)log^)]^ix-| [p(x) - g(x)]dx (8) 
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We use the cooperativity functions h^{do{ri, r^)) and g^{dE{^, Wj)) as positive measures, 
in the following abbreviated by h^J and gi^. 

Inspired by t-SNE |l5l, neighborhood functions h'^J of the low-dimensional embedding space 
O can be chosen as a heavy-tailed distribution, e.g. a Student t or Cauchy-Lorentzian, in order 
to alleviate the so-called 'crowding problem' which usually occurs in high dimensions related 
to the curse of dimensionality. It is important to notice that in XIM the neighborhood function 
of the embedding space refers to the ordering space O of the topographic mapping. This marks 
a fundamental difference when compared to XOM [[TTll and its variant NE-XOM H, in which 
the neighborhood function of the embedding space refers to the exploration space E of the 
topographic mapping. 

Based on these settings, we obtain a novel cost function 
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where the best-match node r* (x) for data point x is defined such that 
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The derivative of this cost function with respect to prototypes is explained in section HI 
It yields the following online learning update rule for a given data vector x: 
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For the special case of a Gaussian neighborhood function in the high-dimensional data space 
E, i.e. gi^ = exp (-^^^^) , 7 > 0, the learning rule (O yields 



Aw. 
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where h^^ may be chosen as a monotonically decreasing function, e.g. 



(i) as a Gaussian 

h*J = exp P^^^'J;'^'^ ) (XIM), (14) 

(ii) as a t-distribution 

/i;^' = (^1 + -(io(r*, r,)) ' (t-XIM), (15) 

(iii) as a Cauchy-Lorentz distribution, 

h*J =(l + j^do{r*,r,)y (c-XIM), (16) 

where we call the resulting algorithms XIM, t-XIM, and c-XIM, respectively. Obviously (iii) 
can be seen as a special case of (ii). 

For the special choice of a squared Euclidean metric in E, i.e. ^^(x, wj) = (x — w^)^, the 



learning rule ((131) yields 

Aw, = -1 [h*J - exp (-^^^^)) (- - w,) . (17) 

4 Derivative of the XIM Cost Function 

With the abbreviation h'^J = hfj(do{r\ r^)) and gi^ = g^{dE{x\ w-')), we write the derivative of 
the cost function equation ^ with respect to a prototype vector w'^: 
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with r**^^(x) = r*(x) as defined in equation (flOl) . The latter term yields the learning rule. The 
first term vanishes, as can be seen as follows: We use the shorthand notation 
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Then the best-match node can be expressed as 



Hence the additional first term of equation dTHl) vanishes, because of the following: 
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- n + - 



^($^(r^x) - $^(r',x)) . (^^ - . $^(r\x)p(x)dx 

- / E (E ^(*^(r^ x) - $^(/, X)) - n + 

5($^(r\x) - $^(r\x)) • (^^ - /ij,^)^ • $^(r^x)p(x)dx = 0, (23) 



because of the symmetry of S and the fact that 5 is non- vanishing only if $^ (r , x) = $^ (r* , x) ) . 
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5 Summary of the XIM Algorithm 

In summary, the XIM algorithm encompasses the following steps: 



Exploratory Inspection Machine (XIM) 

Input: Data vectors x or pairwise dissimilarities dsi^i, x^) (see section|7]) 

Initialize: Number of iterations tmax. learning rate e, cooperativity parameters a, 7, struc- 
ture hypothesis r^, initial prototypes Wj 

For t < tmax do 

begin 

1 . Randomly draw a data vector x. 

2. Find best-match node using equation (flOl) . 

3. Compute neighborhood cooperativity (may be pre-computed and retrieved from a 
look-up table). 

4. Compute neighborhood cooperativity 

5. Update prototypes according to equations (fTT|) and (fT2)) . e.g. using equations (fT3l) - 
end 

Output: Prototypes wj 

Optional: Compute explicit mapping by applying approximation or interpolation schemes 
to explicitly calculate a low-dimensional target point y for each data point x, see section |6l 

6 Algorithmic Properties and Implementation Issues 

XIM, unlike SNE and many other embedding algorithms, exhibits the interesting property that 
it allows to impose a prior structure on the projection space, which is a property that can also 
be found in XOM and SOM. Unlike SNE and many other data visualization techniques which 
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exhibit a computational and memory complexity that is quadratic in the number of data points, 
the complexity of XIM can be easily controlled by the structure hypothesis definition, i.e. the 
node grid in the observation space, and is linear with the number of data points and the number 
of prototypes. When implementing XIM, the following issues may be considered: 

Best-match definition. Although from a theoretical point of view, the best-match definition 
should be made according to equation (fTOl) . we found that using the much simpler minimal 
distance approach according to equation Q yields comparable results in practical computer 
simulations. 

Weighting of attractive and repulsive forces. It should be noted that the factor ^ in the 
update rules (fT3l) and (fTT]) can be seen as a quantity that determines the stepsize of a gradient 
descent step on the cost function, which can be treated as a free parameter similarly as the 
learning parameter e in the online learning rule (fTTI) . We found it useful to omit the factor 
and control the stepsize of gradient descent by annealing of the learning parameter e only and 
to introduce a relative weighting of attractive and repulsive forces in learning rule (fTTI) . i.e. 



where the method proved robust for a wide range of values for r] G [0.1, 0.5]. 

Similarly as in other topographic mapping approaches, the parameters e, cr, and 7 may be 
adapted using annealing schemes, e.g. by using an exponential decay 



where k := e, n := a, or k := 7, respectively, and t E [ti, tmax] denotes the iteration step. 

As known from other algorithms, such as SNE or XOM, it can be useful to adapt the variance 
of the neighborhood cooperativity in the data space to the local 'density' of neighbors, i.e. to 
use a different 7j for each data sample Xj so that an a-ball of variance 7^ would contain a fixed 
number k of neighbors. The rationale of this would be to ensure that data samples in less dense 




(24) 




(25) 
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regions are adequately represented in the calculation of the embedding. Instead of annealing 7 
directly, this number k may be annealed, e.g. according to equation (l25l) by setting k := k. An 
alternative approach is to find appropriate 7^ by using the 'perplexity' approach proposed for 
SNE in ill. 

Explicit mapping of data to target points. As the output of XIM is the location of pro- 
totypes Wj in the exploration space E, there is no explicit mapping of each data vector onto 
its target points. This marks a fundamental difference between XIM and XOM: As in XOM, 
each data vector has its own prototype, the final prototype positions already represent the low- 
dimensional embeddings for each data point. This does not hold for XIM. However, an explicit 
mapping can be accomplished easily by using the tuples (wj, vj) E E x O as pairs of reference 
points for defining a mapping \l/ which allows the user to find a low-dimensional target point 
for each data point Xj. Such an explicit mapping based on computed pairs of reference points 
can be calcualted by a wide range of interpolation or approximation procedures. Here, we 
specifically mention three approaches that have been successfully applied in similar contexts, 
namely Shepard's interpolation [|12l . generalized radial basis functions softmax interpolation, 
e.g. [|T3l . and the application of supervised learning methods. For the latter approach, {wj, rj) 
pairs are used as labeled examples for models of supervised learning, such as feed-forward neu- 
ral networks or other function approximators. For techical details of how to use these methods 
in a related context, we refer to our previous work [fT3l. It should be noted that such explicit 
mapping approaches can also be used for 'out-of-sample' extension, i.e. for finding embeddings 
of new data points that have not previously been used for XIM training. 

7 Batch XIM and XIM for Non-Metric Data 

Batch XIM algorithms can easily be derived from analyzing the stationary states of learning 
rules (fT2l) - (fT3l) . where we expect the expectation values for these updates over all data points 
to be zero. For example, for learning rule (fT3l) . we expect the following condition to be fulfilled 
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in the stationary state 




(26) 



where Si denotes the expectation value over all data points Xj and /r*(xi),rj = h*J 
squared Euclidean metric, we obtain the batch XIM fixed-point iteration 

X]j /r*(xi),rwXj 
Wj = . 

X^i /r*(xi),rj 



gi^. For the 



(27) 



In analogy to equations (6) - (8) in [(T4| . even simpler and faster batch XIM methods can 
easily be obtained by introducing Voronoi sets of the exploration space according to the best- 
match criterion (flOl) . For practical implementations, however, the simple minimal distance 
criterion ([5]) may be used. 

Finally, we introduce XIM for non-metric data by combining the batch XIM with the 
concept of the generalized median. This can be accomplished in full analogy to SOM for 
nonvectorial data according to [fT5l . However, a fundamental difference is that in forming the 
sum of distances for computing the generalized median, the contents of the node-specific data 
sublists within a neighborhood set of grid nodes should be weighted by the above cooperativity 
/r*(x,),rj rather than h*J alone as in [[TSl . Likewise, the best-match definition (flOl) should be 
used instead of a simple minimal distance criterion (|5]). 

8 Experiments 

To prove the applicability of XIM, we present results on real-world data, see Figs. [H and l2l 
The data consists of 147 feature vectors in a 79-dimensional space encoding gene expression 
profiles obtained from microarray experiments. 

Fig. [U shows visualization results obtained from structure-preserving dimensionality reduc- 
tion of gene expression profiles related to ribosomal metabolism, as a detailed visualization of 
a subset included in the genome- wide expression data taken from Eisen et al. [16]. The fig- 
ure illustrates the exploratory analysis of the 147 genes labeled as '5' (22 genes) and '8' (125 
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genes) according to the cluster assignment by Eisen et al. |fT6l . Besides several genes involved 
in respiration, cluster '5' (blue) contains genes related to mitochondrial ribosomal metabolism, 
whereas cluster '8' (red) is dominated by genes encoding ribosomal proteins and other proteins 
involved in translation, such as initiation and elongation factors, and a tRNA synthetase. The 
data has been described in ifTTl . 

In the c-XIM genome map of Fig. [l}\ using a square grid of 30 x 30 nodes in the or- 
dering space , it is clearly visible at first glance that the data consists of two distinct clusters. 
Comparison with the functional annotation known for these genes [|T8ll reveals that the map 
overtly separates expression profiles related to mitochondrial and to extramitochondrial ribo- 
somal metabolism. Fig. [1^ shows a data representation obtained by a Self-Organizing Map 
(SOM) trained on the same data also using a square grid of 30 x 30 nodes in the ordering space. 
As can be clearly seen in the figure, SOM cannot achieve a satisfactory cluster separation in the 
mapping result as provided by c-XIM in Fig. [TK: Although the genes related to mitochondrial 
and to extramitochondrial ribosomal metabolism are collocated on the map, the distinct cluster 
structure underlying the data remains invisible, if the color coding is omitted. The result ob- 
tained by the Exploratory Observation Machine (XOM) in Fig. [It as well as the mapping result 
obtained by Principal Component Analysis (PCA) in Fig. [2l however, can properly recover the 
underlying cluster structure. Note that the result obtained by XIM using Gaussian neighbor- 
hood functions in the ordering space (Fig.[lX)) resembles the SOM result in that a clear cluster 
separation cannot be observed. The comparison of Figs. [IK and D thus demonstrate the benefit 
of introducing heavy-tailed neighborhood functions in the ordering space for this example. 

A quantitative comparison of several quality measures for the results obtained in the data 
set of Figs. [Hand [2] is presented in Tab. [B 
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Table 1: Comparative evaluation of quantitative embedding quality measures as known from 
the literature for the results obtained for dimensionality reduction of genome expression profiles 
related to ribosomal metabolism in Figs. [Hand [21 namely Sammon's error |[T9l . Spearman's p, 
as well as 'trustworthiness' and 'continuity' as defined by [i6|. The table encompasses mean 
values and standard deviations obtained for 10 runs of each method, where in each run 140 
out of 147 data points (approx. 95% of the data) are randomly sub-sampled, i.e. embeddings 
and corresponding quality measures are computed for each of the 10 sub-sampled data sets. 
Trustworthiness and continuity values are computed as mean values obtained for k = 1, . . . , 50 
neighbors. The free parameters of all the methods examined in the comparison (except PC A) 
were optimized to obtain the best results with regard to the respective quality measure. Explicit 
mappings for c-XIM and SOM were computed using Shepard's interpolation Note that in 
comparison to SOM, c-XIM yields competitive results for several quality measures. 



Method 


Sammon 


Spearman's p 


Trustworthiness 


Continuity 


SOM 


0.18 (0.02) 


0.50 (0.05) 


0.84 (0.02) 


0.85 (0.01) 


c-XIM 


0.17 (0.02) 


0.59 (0.09) 


0.87 (0.02) 


0.86 (0.02) 


XOM 


0.19(0.03) 


0.88 (0.02) 


0.86 (0.02) 


0.90 (0.02) 


PCA 


0.18 (0.01) 


0.86 (0.01) 


0.85 (0.01) 


0.89 (0.01) 



9 XIM Using Other Divergences 

XIM, as described so far, has been introduced based on the generalized KuUback-Leibler di- 
vergence. However, it is important to emphasize that XIM is not restricted to this specific 
divergence measure alone, but other divergences can be used easily. However, our work pre- 
sented here differs substantially from prior approaches to utilize divergences for nonlinear di- 
mensionality reduction. In contrast to SNE [|7|| and t-SNE |[5l, divergence-based cost functions 
for nonlinear dimensionality reduction are not optimized directly, but via iterative incremental 
online learning. In contrast to [[20l . we do not use divergences as distance measures within the 
data or the embedding space, but as a dissimilarity measure between these spaces. In contrast to 
our previous work [|4l, which applies divergences to XOM, the XIM method presented in this 
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paper systematically inverts the quoted approaches by introducing divergences in order to de- 
rive novel topographic vector quantization learning rules. This is accomplished by calculating 
the derivatives of divergence-based topographic vector quantization cost functions with respect 
to distance measures in the high-dimensional data space rather than derivatives with respect to 
distance measures in the the low -dimensional embedding space, such as in flU or [|2TI . 

As an important consequence, we can thus introduce successful approaches to alleviate the 
so-called 'crowding phenomenon' as described by [|5J into classical topographic vector quanti- 
zation schemes, in which, in contrast to XOM and its variants, data items are associated with 
the exploration space and not with the ordering space of topographic mappings. Specifically, 
the deep difference when compared to the XOM approach becomes evident, when heavy-tailed 
distributions in the low-dimensional embedding space are utilized to address this issue: The 
update rules for the resulting t-XIM (and c-XIM) algorithms are significantly different from 
the t-NE-XOM update rules as specified in [4] for the KuUback-Leibler divergence and in [|2T| 
for other divergences. The reason is that for t-XIM and c-XIM, in contrast to NE-XOM and 
its variants, there is no need to re-compute the derivative of the neighborhood function h of 
the low-dimensional embedding space with respect to its underlying distance measure when 
computing the new learning rules. Instead, only h, not g has to be replaced by a heavy-tailed 
distribution. In other words, we can derive learning rules for XIM by directly adopting deriva- 
tives of divergences as known from the mathematical literature and creatively applying them to 
a completely different context. 

10 Extensions of XIM 

XIM is likely to be even more useful by slight extensions or in combination with other methods 
of data analysis and statistical learning. The combination with interpolation and approximation 
schemes for explicit mapping between data and embedding spaces has already been discussed 
above. Other straightforward variants of XIM comprise the iterative use of XOM, local or 
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hierarchical data processing schemes, e.g. by initial vector quantization, and variable choices of 
batch, growing, or speed-up variants analogous to those described in the literature on topology- 
preserving mappings. 

As noted earlier, there are no principal restrictions to input data and structure hypotheses. 
For example, they may be subject to a non-Euclidean, e.g. hyperbolic, geometry, or even rep- 
resent nonmetric dissimilarities. Distances in the observation and exploration spaces may be 
rescaled dynamically during XIM training. These slight extensions should further enhance the 
applicability of XIM to many areas of information processing. 
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Figure 1: Nonlinear dimensionality reduction of genome expression profiles related to ribo- 
somal metabolism using (A) Cauchy-Lorentz Exploratory Inspection Machine (c-XIM), (B) 
Self-Organizing Map (SOM), (C) Exploratory Observation Machine (XOM), and (D) XIM with 
Gaussian neighborhood functions in the ordering space. For explanation, see text. 
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Figure 2: Dimensionality reduction of genome expression profiles related to ribosomal 
metabolism using Principal Component Analysis (PCA). For explanation, see text. 
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